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Abstract 

We report on a method for compiling 
decision trees into weighted finite-state 
transducers. The key assumptions are 
that the tree predictions specify how to 
rewrite symbols from an input string, 
and the decision at each tree node is 
stateable in terms of regular expressions 
on the input string. Each leaf node 
can then be treated as a separate rule 
where the left and right contexts are 
constructable from the decisions made 
traversing the tree from the root to the 
leaf. These rules are compiled into trans- 
ducers using the weighted rewrite-rule 
rule-compilation algorithm described in 
(Mohri and Sproat, 1996). 
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p 1 1 Introduction 



Much attention has been devoted recently to 
methods for inferring linguistic models from data. 
One powerful inference method that has been 
used in various applications are decision trees, 
and in particular classification and regression trees 
(Breiman et al., 1984). 

An increasing amount of attention has also 
been focussed on finite-state methods for imple- 
menting linguistic models, in particular finite- 
state transducers and weighted finite-state trans- 
ducers; see (Kaplan and Kay, 1994; Pereira et al., 
1994, inter alia). The reason for the renewed in- 
terest in finite-state mechanisms is clear. Finite- 
state machines provide a mathematically well- 
understood computational framework for repre- 
senting a wide variety of information, both in NLP 
and speech processing. Lexicons, phonological 
rules, Hidden Markov Models, and (regular) gram- 
mars are all representable as finite-state machines, 
and finite-state operations such as union, intersec- 
tion and composition mean that information from 
these various sources can be combined in useful 



and computationally attractive ways. The reader 
is referred to the above-cited papers (among oth- 
ers) for more extensive justification. 

This paper reports on a marriage of these two 
strands of research in the form of an algorithm for 
compiling the information in decision trees into 
weighted finite-state transducers. 1 Given this al- 
gorithm, information inferred from data and rep- 
resented in a tree can be used directly in a system 
that represents other information, such as lexicons 
or grammars, in the form of finite-state machines. 

2 Quick Review of Tree-Based 
Modeling 

A general introduction to classification and regres- 
sion trees ('CART') including the algorithm for 
growing trees from data can be found in (Breiman 
et al., 1984). Applications of tree-based modeling 
to problems in speech and NLP are discussed in 
(Riley, 1989; Riley, 1991; Wang and Hirschberg, 
1992; Magerman, 1995, inter alia). In this section 
we presume that one has already trained a tree 
or set of trees, and we merely remind the reader 
of the salient points in the interpretation of those 
trees. 

Consider the tree depicted in Figure 1, which 
was trained on the TIMIT database (Fisher et al., 
1987), and which models the phonetic realization 
of the English phoneme /aa/ (/a/) in various en- 
vironments (Riley, 1991). When this tree is used 
in predicting the allophonic form of a particular 
instance of /aa/, one starts at the root of the 
tree, and asks questions about the environment 
in which the /aa/ is found. Each non-leaf node n, 
dominates two daughter nodes conventionally la- 
beled as 2n and 2n + 1; the decision on whether to 
go left to 2n or right to 2n + 1 depends on the an- 
swer to the question that is being asked at node n. 



J The work reported here can thus be seen as com- 
plementary to recent reports on methods for directly 
inferring transducers from data (Oncina et al., 1993; 
Gildea and Jurafsky, 1995). 



A concrete example will serve to illustrate. Con- 
sider that we have /aa/ in some environment. The 
first question that is asked concerns the number of 
segments, including the /aa/ itself, that occur to 
the left of the /aa/ in the word in which /aa/ oc- 
curs. (See Table 1 for an explanation of the sym- 
bols used in Figure 1.) In this case, if the /aa/ 
is initial — i.e., Iseg is 1, one goes left; if there 
is one or more segments to the left in the word, 
go right. Let us assume that this /aa/ is initial 
in the word, in which case we go left. The next 
question concerns the consonantal 'place' of artic- 
ulation of the segment to the right of /aa/; if it 
is alveolar go left; otherwise, if it is of some other 
quality, or if the segment to the right of /aa/ is not 
a consonant, then go right. Let us assume that the 
segment to the right is /z/, which is alveolar, so we 
go left. This lands us at terminal node 4. The tree 
in Figure 1 shows us that in the training data 119 
out of 308 occurrences of /aa/ in this environment 
were realized as [ao] , or in other words that we can 
estimate the probability of /aa/ being realized as 
[ao] in this environment as .385. The full set of 
realizations at this node with estimated non-zero 
probabilities is as follows (see Table 2 for a rele- 
vant set of ARPABET-IPA correspondences): 



phone 


probability 


- log 


ao 


0.385 


0.95 


aa 


0.289 


1.24 


q+aa 


0.103 


2.27 


q+ao 


0.096 


2.34 


ah 


0.069 


2.68 


ax 


0.058 


2.84 



prob. (weight) 



An important point to bear in mind is that a 
decision tree in general is a complete description, 
in the sense that for any new data point, there 
will be some leaf node that corresponds to it. So 
for the tree in Figure 1, each new novel instance 
of /aa/ will be handled by (exactly) one leaf node 
in the tree, depending upon the environment in 
which the /aa/ finds itself. 

Another important point is that each deci- 
sion tree considered here has the property that 
its predictions specify how to rewrite a symbol (in 
context) in an input string. In particular, they 
specify a two-level mapping from a set of input 
symbols (phonemes) to a set of output symbols 
(allophones). 

3 Quick Review of Rule 
Compilation 

Work on finite-state phonology (Johnson, 1972; 
Koskenniemi, 1983; Kaplan and Kay, 1994) has 
shown that systems of rewrite rules of the famil- 
iar form <f> — ► ip/X p, where <f>, ip, A and p are 



regular expressions, can be represented computa- 
tionally as finite-state transducers (FSTs): note 
that <f> represents the rule's input rule, ip the out- 
put, and A and p, respectively, the left and right 
contexts. 

Kaplan and Kay (1994) have presented a con- 
crete algorithm for compiling systems of such 
rules into FSTs. These methods can be ex- 
tended slightly to include the compilation of prob- 
abilistic or weighted rules into weighted finite- 
state-transducers (WFSTs — see (Pereira et al., 
1994)): Mohri and Sproat (1996) describe a rule- 
compilation algorithm which is more efficient than 
the Kaplan-Kay algorithm, and which has been 
extended to handle weighted rules. For present 
purposes it is sufficient to observe that given this 
extended algorithm, we can allow ip in the expres- 
sion <f> — ► ip/X p, to represent a weighted reg- 
ular expression. The compiled transducer corre- 
sponding to that rule will replace <f> with ip with 
the appropriate weights in the context A p. 

4 The Tree Compilation Algorithm 

The key requirements on the kind of decision trees 
that we can compile into WFSTs are (1) the pre- 
dictions at the leaf nodes specify how to rewrite 
a particular symbol in an input string, and (2) 
the decisions at each node are stateable as regu- 
lar expressions over the input string. Each leaf 
node represents a single rule. The regular expres- 
sions for each branch describe one aspect of the 
left context A, right context p, or both. The left 
and right contexts for the rule consist of the inter- 
sections of the partial descriptions of these con- 
texts defined for each branch traversed between 
the root and leaf node. The input <f> is prede- 
fined for the entire tree, whereas the output ip is 
defined as the union of the set of outputs, along 
with their weights, that are associated with the 
leaf node. The weighted rule belonging to the leaf 
node can then be compiled into a transducer us- 
ing the weighted-rule-compilation algorithm refer- 
enced in the preceding section. The transducer for 
the entire tree can be derived by the intersection 
of the entire set of transducers associated with the 
leaf nodes. Note that while regular relations are 
not generally closed under intersection, the subset 
of same-length (or more strictly speaking length- 
preserving) relations is closed; see below. 

To see how this works, let us return to the ex- 
ample in Figure 1. To start with, we know that 
this tree models the phonetic realization of /aa/, 
so we can immediately set <f> to be aa for the whole 
tree. Next, consider again the traversal of the tree 
from the root node to leaf node 4. The first deci- 
sion concerns the number of segments to the left 
of the /aa/ in the word, either none for the left 




cp1 :alv 

cp1 :blab,labd,den,pal,vel,pha,n/a / vp-1 :fl,fml,fmh,'fh,cml,bmh,n/a 




aa 




q+aa 




aa 




aa 


110/349 


69/128 


2080/2080 


415/439 


10 


11 


14 


15 



Figure 1: Tree modeling the phonetic realization of /aa/. All phones are given in ARPABET. Table 2 gives 
ARPABET-IPA conversions for symbols relevant to this example. See Table 1 for an explanation of other 
symbols 



cpn place of articulation of consonant n segments to the right 
cp-n place of articulation of consonant n segments to the left 

values: alveolar; bilabial; labiodental; dental; palatal; velar; pharyngeal; 

n/a if is a vowel, or there is no such segment 



vpn place of articulation of vowel n segments to the right 
vp-n place of articulation of vowel n segments to the left 

values: central-mid-high; back-low; back-mid-low; back-high; front-low; 

front-mid-low; front-mid-high; front-high; central-mid-low; back-mid-high 

n/a if is a consonant, or there is no such segment 



Is eg 
rseg 


number of preceding segments including the segment of interest 
number of following segments including the segment of interest 
values: 1, 2, 3, many 


within the word 
within the word 


str 


stress assigned to this vowel 






values: primary, secondary, no (zero) stress 






n/a if there is no stress mark 





Table 1: Explanation of symbols in Figure 1. 



aa a 

ao o 

ax 9 

ah a 

q+aa 7a 

q+ao 7o 

Table 2: ARPABET-IPA conversion for symbols relevant for Figure 1. 



branch, or one or more for the right branch. As- 
suming that we have a symbol a representing a 
single segment, the symbol # representing a word 
boundary, and allowing for the possibility of in- 
tervening optional stress marks — ' — which do 
not count as segments, these two possibilities can 
be represented by the regular expressions for A in 
(a) of Table 3. 2 At this node there is no deci- 
sion based on the righthand context, so the right- 
hand context is free. We can represent this by 
setting p at this node to be £*, where £ (con- 
ventionally) represents the entire alphabet: note 
that the alphabet is defined to be an alphabet of 
all <f>:ip correspondence pairs that were determined 
empirically to be possible. 

The decision at the left daughter of the root 
node concerns whether or not the segment to the 
right is an alveolar. Assuming we have defined 
classes of segments alv, blab, and so forth (repre- 
sented as unions of segments) we can represent the 
regular expression for p as in (b) of Table 3. In 
this case it is A which is unrestricted, so we can 
set that at S* . 

We can derive the A and p expressions for 
the rule at leaf node 4 by intersecting together 
the expressions for these contexts defined for each 
branch traversed on the way to the leaf. For 
leaf node 4, A = #Opt(') OS* = #Opt('), and 
p = S* n Opt{'){alv) = Opt( r )(alv). 3 The rule 
input <f> has already been given as aa. The output 
ip is defined as the union of all of the possible ex- 
pressions — at the leaf node in question — that aa 
could become, with their associated weights (neg- 
ative log probabilities), which we represent here as 
subscripted floating-point numbers: 

ip = ao . 95 U aai.24 U g+aa 2 . 27 U q+ao 2 , 34U 

all 2 .68 U <M 2 .84 

Thus the entire weighted rule can be written as 



2 As far as possible, we use the notation of Kaplan 
and Kay (1994). 

3 Strictly speaking, since the As and ps at each 
branch may define expressions of different lengths, it 
is necessary to left-pad each A with E*, and right-pad 
each p with E*. We gloss over this point here in order 
to make the regular expressions somewhat simpler to 
understand 



follows: 

aa —> (aoo.95Uaa1.24Ug-l-aa2.27Ug-l-ao2.34Ua/i2.68U 
ax 2 , 84 )/#Opt( r ) Opt{'){alv) 

By a similar construction, the rule at node 6, for 
example, would be represented as: 

aa (aa .40 U ao 1A1 )/ (#(Opt(')a)+ Opt(')) C\ 
(£*((cm/i) U (bl) U (6m/) U (bh))) S* 

Each node thus represents a rule which states 
that a mapping occurs between the input symbol 
<f> and the weighted expression ip in the condition 

described by A p. Now, in cases where <f> finds 

itself in a context that is not subsumed by A p, 

the rule behaves exactly as a two-level surface co- 
ercion rule (Koskenniemi, 1983): it freely allows 
<f> to correspond to any ip as specified by the al- 
phabet of pairs. These <f>:ip correspondences are, 
however, constrained by other rules derived from 
the tree, as we shall see directly. 

The interpretation of the full tree is that it 
represents the conjunction of all such mappings: 
for rules 1, 2 . . .n, 4> corresponds to t/>i given con- 
dition Ai p\ and <f> corresponds to ip 2 given 

condition A2 p 2 . . . and <f> corresponds to ip n 

given condition A n p n . But this conjunction is 

simply the intersection of the entire set of trans- 
ducers defined for the leaves of the tree. Observe 
now that the <f>:ip correspondences that were left 
free by the rule of one leaf node, are constrained 
by intersection with the other leaf nodes: since, as 
noted above, the tree is a complete description, it 
follows that for any leaf node i, and for any context 

A p not subsumed by A; pi, there is some 

leaf node j such that Xj pj subsumes A p. 

Thus, the transducers compiled for the rules at 
nodes 4 and 6, are intersected together, along with 
the rules for all the other leaf nodes. Now, as 
noted above, and as discussed by Kaplan and Kay 
(1994) regular relations — the algebraic counter- 
part of FSTs — are not in general closed under 
intersection; however, the subset of same-length 
regular relations is closed under intersection, since 
they can be thought of as finite-state acceptors ex- 



(a) left branch A = #Opt{') 

p = S* 



right branch A = {#Opt{')aOpt{')) U (#Opt(')aOpt(')aOpt('))l> 
(if : Optl')aOptl')aOpt(')laOpt(')) + ) 
= (Opt( r )a)+Opt( r ) 



(b) left branch A = S* 
p = Opt(')(alv) 



right branch A = S* 

p = (Opt(')(blab)) U (Opt(')(labd)) U (Opt(')(den)) U (Opt(')(pal))U 
(Opt(')(vel)) U (Opt(')(pha)) U (Opt(')(n/a)) 



Table 3: Regular-expression interpretation of the decisions involved in going from the root node to leaf node 
4 in the tree in Figure 1. Note that, as per convention, superscript '+' denotes one or more instances of an 
expression. 



pressed over pairs of symbols. 4 This point can 
be extended somewhat to include relations that 
involve bounded deletions or insertions: this is pre- 
cisely the interpretation necessary for systems of 
two-level rules (Koskenniemi, 1983), where a sin- 
gle transducer expressing the entire system may 
be constructed via intersection of the transduc- 
ers expressing the individual rules (Kaplan and 
Kay, 1994, pages 367-376). Indeed, our decision 
tree represents neither more nor less than a set of 
weighted two-level rules. Each of the symbols in 
the expressions for A and p actually represent (sets 
of) pairs of symbols: thus alv, for example, rep- 
resents all lexical alveolars paired with all of their 
possible surface realizations. And just as each tree 
represents a system of weighted two-level rules, so 
a set of trees — e.g., where each tree deals with 
the realization of a particular phone — represents 
a system of weighted two-level rules, where each 
two-level rule is compiled from each of the indi- 
vidual trees. 

We can summarize this discussion more for- 
mally as follows. We presume a function Compile 
which given a rule returns the WFST computing 
that rule. The WFST for a single leaf L is thus 
defined as follows, where (f>T is the input symbol 
for the entire tree, ipL is the output expression de- 
fined at L, Pl represents the path traversed from 
the root node to L, p is an individual branch on 

4 One can thus define intersection for transducers 
analogously with intersection for acceptors. Given 
two machines G\ and G2 , with transition functions 
Si and 82, one can define the transition function 
of G, 8, as follows: for an input-output pair (i, 0), 
8({qi , Q2), (i, 0)) = (q[, q'2) if and only if 8 1 (q 1 , (i, 0)) = 
q[ and 8 2 (q2, (i, 0)) = q 2 - 



that path, and X p and p p are the expressions for 
A and p defined at p: 

Rule L = Compile(<f> T — > iph/ {~\ \ P| P P ) 

p£P L p£Pl 

The transducer for an entire tree T is defined as: 

RuleT = Rule^ 

Finally, the transducer for a forest F of trees is 
just: 

Rulep = 1^ RuleT 
TeF 



5 Empirical Verification of the 
Method. 

The algorithm just described has been empiri- 
cally verified on the Resource Management (RM) 
continuous speech recognition task (Price et al., 
1988). Following somewhat the discussion in 
(Pereira et al., 1994; Pereira and Riley, 1996), 
we can represent the speech recognition task as 
the problem of finding the best path in the com- 
position of a grammar (language model) G, the 
transitive-closure of a dictionary D mapping be- 
tween words and their phonemic representation, 
a model of phone realization $, and a weighted 
lattice representing the acoustic observations A. 



Thus: 



BestPath(G o D* o$o A) (1) 

The transducer $ = f] TeF Rulex can be con- 
structed out of the forest F of 40 trees, one for 
each phoneme, trained on the TIMIT database. 
The size of the trees range from 1 to 23 leaf nodes, 
with a total of 291 leaves for the entire forest. 

The model was tested on 300 sentences from 
the RM task containing 2560 word tokens, and 
approximately 10,500 phonemes. A version of the 
model of recognition given in expression (1), where 
$ is a transducer computed from the trees, was 
compared with a version where the trees were used 
directly following a method described in (Ljolje 
and Riley, 1992). The phonetic realizations and 
their weights were identical for both methods, thus 
verifying the correctness of the compilation algo- 
rithm described here. 

The sizes of the compiled transducers can be 
quite large; in fact they were sufficiently large that 
instead of constructing $ beforehand, we inter- 
sected the 40 individual transducers with the lat- 
tice D* at runtime. Table 4 gives sizes for the 
entire set of phone trees: tree sizes are listed in 
terms of number of rules (terminal nodes) and raw 
size in bytes; transducer sizes are listed in terms 
of number of states and arcs. Note that the entire 
alphabet comprises 215 symbol pairs. Also given 
in Table 4 are the compilation times for the indi- 
vidual trees on a Silicon Graphics R4400 machine 
running at 150 MHz with 1024 Mbytes of memory. 
The times are somewhat slow for the larger trees, 
but still acceptable for off-line compilation. 

While the sizes of the resulting transducers 
seem at first glance to be unfavorable, it is im- 
portant to bear in mind that size is not the only 
consideration in deciding upon a particular repre- 
sentation. WFSTs possess several nice properties 
that are not shared by trees, or handwritten rule- 
sets for that matter. In particular, once compiled 
into a WFST, a tree can be used in the same way 
as a WFST derived from any other source, such as 
a lexicon or a language model; a compiled WFST 
can be used directly in a speech recognition model 
such as that of (Pereira and Riley, 1996) or in a 
speech synthesis text-analysis model such as that 
of (Sproat, 1996). Use of a tree directly requires 
a special-purpose interpreter, which is much less 
flexible. 

It should also be borne in mind that the size 
explosion evident in Table 4 also characterizes 
rules that are compiled from hand-built rewrite 
rules (Kaplan and Kay, 1994; Mohri and Sproat, 
1996). For example, the text-analysis ruleset for 



the Bell Labs German text-to-speech (TTS) sys- 
tem (see (Sproat, 1996; Mohri and Sproat, 1996)) 
contains sets of rules for the pronunciation of var- 
ious orthographic symbols. The ruleset for <a>, 
for example, contains 25 ordered rewrite rules. 
Over an alphabet of 194 symbols, this compiles, 
using the algorithm of (Mohri and Sproat, 1996), 
into a transducer containing 213,408 arcs and 
1,927 states. This is 72% as many arcs and 48% 
as many states as the transducer for /ah/ in Ta- 
ble 4. The size explosion is not quite as great here, 
but the resulting transducer is still large compared 
to the original rule file, which only requires 1428 
bytes of storage. Again, the advantages of rep- 
resenting the rules as a transducer outweigh the 
problems of size. 5 

6 Future Applications 

We have presented a practical algorithm for con- 
verting decision trees inferred from data into 
weighted finite-state transducers that directly im- 
plement the models implicit in the trees, and we 
have empirically verified that the algorithm is cor- 
rect. 

Several interesting areas of application come 
to mind. In addition to speech recognition, where 
we hope to apply the phonetic realization models 
described above to the much larger North Amer- 
ican Business task (Paul and Baker, 1992), there 
are also applications to TTS where, for example, 
the decision trees for prosodic phrase-boundary 
prediction discussed in (Wang and Hirschberg, 
1992) can be compiled into transducers and used 
directly in the WFST-based model of text analysis 
used in the multi-lingual version of the Bell Lab- 
oratories TTS system, described in (Sproat, 1995; 
Sproat, 1996). 
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ARPABET phone 


# nodes 


size of tree (bytes) 
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time (sec) 
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Table 4: Sizes of transducers corresponding to each of the individual phone trees. 
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